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Protein structure prediction can be shown to be an NP- 
hard problem; the number of conformations grows exponen- 
tially with the number of residues. The native conformations 
of proteins occupy a very small subset of these, hence an ex- 
ploratory, robust search algorithm, such as a genetic algo- 
rithm (GA), is required. The dynamics of GAs tend to be 
complicated and problem-specific. However, their empirical 
success warrants their further study. In this paper, guidelines 
for the design of genetic algorithms for protein structure pre- 
diction are determined. To accomplish this, the performance 
of the simplest genetic algorithm is investigated for simple 
lattice-based protein structure prediction models (which is ex- 
tendible to real-space), using energy minimization. The study 
has led us to two important conclusions for 'protein-structure- 
prediction-genetic-algorithms'. Firstly, they require high res- 
olution building blocks attainable by multi-point crossovers 
and secondly they require a local dynamics operator to 'fine 
tune' good conformations. Furthermore, we introduce a sta- 
tistical mechanical approach to analyse the genetic algorithm 
dynamics and suggest a convergence criterion using a quantity 
analogous to the free energy of a population. 
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I. INTRODUCTION 

Conceptually, proteins fold from their ID polymer 
chain of amino acids (primary structure) to 3D stable, 
'unique' conformations (tertiary structure). Anfmsen jlj 
showed that folding requires knowledge of the amino acid 
sequence alone; the determination of the native (biolog- 
ically functioning) structure from its sequence is known 
as the protein folding problem. Much research has gone 
into elucidating the folding dynamics g but a practical 
theory is still beyond our understanding. 

The folding problem attempts to understand the dy- 
namics of the folding - how the sequence of amino acids 
arrive at the native state. It is clear that a full solution to 
this problem would be able to predict the tertiary struc- 
ture of an amino acid sequence. Unfortunately, such a 
solution is beyond our understanding at present. 

However, amino acid sequences are effectively being 
determined at a higher rate than that of their corre- 
sponding native structures. Since knowledge of the native 
structure is important to understanding the function of 
a protein, a potentially more practical problem is pro- 
tein structure prediction. There is a possible confusion 
in terminology between the two terms, protein folding 
and protein structure prediction^ researchers use the 
terms loosely and interchangeably. The difference must 
be stressed as there are 'folding simulations', in the liter- 
ature, which carry out a conformational search in a man- 
ner that is not attributable to the true folding dynamics 
of proteins (e.g. (D). 

Protein folding mainly concerns the dynamics of the 
problem which experimentalists study using hydrogen- 
exchange techniques, for example; theorists often use 
computer simulations of simple models, often on lat- 
tices, to elucidate the folding dynamics. Protein struc- 
ture prediction, however, is only interested in the end re- 
sult; experimentalists often use crystallisation techniques 
coupled with x-ray diffraction to determine the tertiary 
structure, whereas theorists tend to use computer-based 
optimization methods. This is the approach discussed 
in this paper. The problem involves two aspects: (1) 
the specification of the function to optimize and (2) the 
choice of a search algorithm. 
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The prediction of protein structures using optimization 
methods have been more successful using comparative 
modelling techniques which include sequence align- 
ment, threading Q and the use of secondary structural 
propensities ||. However, a physical approach based on 
energy minimization is used here. It is clear that the in- 
teractions between the residues and the solvent molecules 
drive the protein towards its native state. Determining 
the fundamental interactions is important, not only for 
protein structure prediction, but also for the protein fold- 
ing problem. A further reason for the physical approach 
is that comparative modelling requires the known struc- 
ture of one or more homologous proteins - these cannot 
always be guaranteed to exist. 

The second aspect of the protein structure prediction 
problem is concerned with designing a search strategy 
for the energy minimization. It is clear that a blind 
search through the conformational space is impossible, 
as it would take a time greater than the age of the uni- 
verse. A similar 'paradox' for real proteins, based on ob- 
served protein folding times (~ 10~ 3 s), was first noted by 
Levinthal (7j. Furthermore, several authors have proved 
various models of protein structure prediction to be com- 
putationally NP-hard [] This implies that no effi- 
cient algorithm can be designed to guarantee finding the 
native state amongst the exponentially many. A related 
idea to the complexity of the conformational search is the 
concept of 'rugged' energy landscapes fi^fl . Although op- 
timal solutions are not guaranteed, robust, exploratory, 
non-deterministic search algorithms (e.g. simulated an- 
nealing 1 13 1 and genetic algorithms Q) can locate good, 
near-optimal solutions, within a reasonable time. In the 
present paper, we focus on the design of a good confor- 
mational search strategy for the problem, leaving the dis- 
cussion of the best choice of energy functions for protein 
structure prediction to a later paper. 

Genetic algorithms were invented by John Holland fl^ j 
in his quest for a theory of adaptive processes. The 
concept was inspired by Darwin's evolutionary theory 
(loosely 'survival of the fittest') and in particular 'neo- 
Darwinism' according to which genetic recombinations 
and mutations play a dominant role in the evolution of 
a species. It is generally believed that Nature evolves 
so that individuals that are the best adapted to their 
co-evolving environment survive, while the poor ones die 
off; this is an example of optimization, more commonly 
referred to as 'adaptation' in biology ]15p . Due to their 
highly nonlinear nature, genetic algorithms are difficult 
to analyse. There is no asymptotic global optimum con- 
vergence proof, as there is in simulated annealing [fl6|| , 
nor are there any general rules to design a GA for a spe- 
cific problem. However, their empirical success for the 



solution of numerous NP problems ( |l7[ and references 
therein) warrants their further study. 

Previous applications of genetic algorithms to the pro- 
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have 



tein structure prediction problem [18 - 
not considered the GA design issue. As in any problem, 
the simpler the algorithm is, the fewer the parameters it 
requires, the easier it is to understand and improve the 
performance. For this purpose we use a modified ver- 
sion of Goldberg's Q Simple Genetic Algorithm (SGA), 
written in C. Dandekar and Argos Jlq-|2(| also use a ver- 
sion of the SGA but their work differs to ours in that 
our objectives are different. Firstly, they optimize the 
structural features of proteins (such as a-helices and (3- 
strands) rather than adopting an energy minimization 
approach. Secondly, little comparison was made with 
the performance of other algorithms. Schulze-Kremer 
p2| wrote an elaborate genetic algorithm to optimize 
real-space dihedral angles for a fully atomistic represen- 
tation of proteins, based on CHARMM |25| energy mini- 
mization. Unfortunately, results were poor. Clearly, our 
principal research aim is to predict realistic structures 
as well, but only when a good method has been estab- 
lished. Unger and Moult ||,|l| compared a Monte Carlo 
search with a 'genetic algorithm', using simple exact lat- 
tice models. However, their genetic algorithm is, strictly 
speaking, a hybrid GA. It incorporates several Monte 
Carlo conformers with the occasional crossover between 
structures. For this reason, we call their approach a 'Ge- 
netic Algorithm-Monte Carlo' (GAMC). They compared 
this method with a Monte Carlo search and concluded 
that the GAMC found lower energy solutions. 

In this paper, we compare the Simple Genetic Algo- 
rithm with work by Unger and Moult ; Yue et al. pr^ ] 
and Sali et al. [ p7f and determine guidelines for designing 
protein-structure-prediction-genetic-algorithms. 

The paper is structured in the following way. Section 
H highlights the conformational search issue and the need 
for a genetic algorithm approach. Following that, section 
H| provides a description of the method for determining 
guidelines for GA design. Section IV discusses the Sim- 
ple Genetic Algorithm used for the lattice conformational 
search. The HP-model @ and REM-model ||§ are de- 
scribed in section [v|, as are the various methods used to 
minimize these potentials f 
tional search results from t 



|H^l],^7j . Lattice conforma- 

re SGA are compared with the 



other methods in section VI , while section VII provides a 



discussion of the 'protein structure prediction-genetic al- 
gorithm' (PSP-GA) design principles that have emerged 
from this work. 



II. WHY GENETIC ALGORITHMS FOR 
PROTEIN STRUCTURE PREDICTION? 



1 A good introduction to the theory of NP-completeness and 
NP- hardness can be found in [Hi. 



Protein structure prediction is analytically difficult to 
solve. The problem is thought to stem from the expo- 
nential nature of the conformational search space. The 
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2. A systematic sample search of the parameter space 
was carried out to determine optimal parameter 
values for the SGA. 

3. Having calibrated the SGA in step 2, we compared 
the SGA conformational search ability using several 
test energy functions. 

4. Conformations, and their corresponding energies, 
generated by the SGA were compared with confor- 
mations predicted by other search methods. 

5. The time evolution of conformations generated by 
the SGA were observed. Minimal requirements and 
improvements for PSP-GAs are proposed. 



number of conformations of a protein with N amino acid 
residues grows exponentially as 7^ where 7 is the aver- 
age number of conformations per residue (typically ~10). 
This suggests that an algorithm would require an expo- 
nential time to search the whole conformational space for 
the native state. 

However, problems with exponentially growing search 
spaces are not new in physics (e.g. ideal gases) but many 
are solvable due to the symmetries and conservation laws 
that can be exploited. With proteins there is an added 
difficulty that the interactions are complex - it is not clear 
whether there are enough symmetries to reduce the prob- 
lem to a tractable solution. In addition to this, proteins, 
although macromolecular, do not contain ~ 10 23 atoms 
to guarantee a valid statistical mechanical and thermo- 
dynamic treatment. 

Furthermore, work by Unger and Moult ||; Fraenkel 
jlO| ; Ngo and Marks suggests that the protein struc- 
ture prediction problem is NP-hard, that is computation- 
ally impossible to guarantee an exact solution. Bryngel- 
son, Wolynes et al. JO, for example, advocate that the 
energy landscape of proteins must be 'rugged'. This re- 
flects the various energy barriers that have to be crossed 
and thus the hurdles that a conformational search algo- 
rithm must be able to deal with. 

For these reasons 'intelligent' conformational search al- 
gorithms have become popular in structure prediction. 
Unlike gradient-based methods J3(J , which tend to termi- 
nate at local minima, genetic algorithms 'hop' around the 
conformational space independent of local derivatives. A 
selection process focuses the search in low energy areas, 
whereas a recombination stage maintains exploration of 
the search space. 



III. DESIGNING GENETIC ALGORITHMS FOR 
PROTEIN STRUCTURE PREDICTION 

A genetic algorithm is made up of 4 basic components: 
representation; selection; recombination and evaluation. 
Representation deals with formulating the specific prob- 
lem as a digital string of parameters. This, combined 
with the evaluation function (in our case conformational 
energy) describes the optimization problem. The remain- 
ing two components, selection and recombination, pro- 
vide the dynamics of the GA search, which drive the pop- 
ulation of solutions towards the global optimum. Within 
each unit, there are several options leading to numerous 
variations of genetic algorithms; some examples and cor- 
responding parameters are listed in table I. To determine 
guidelines for designing PSP-GAs, we used the following 
principles: 

1 . A first approach should always be the simplest ap- 
proach. In order to analyse the GA dynamics, we 
used a GA with the simplest options and with the 
fewest parameters - a Simple Genetic Algorithm - 
to search lattice conformational space. 



GA Options and Parameters 
Population: static or variable size 

Representation of solutions: bit string, reals, symbolic 
Maximum number of generations or convergence criteria 
Recombination operators: 1-pt crossover, uniform crossover, 
mutation, perturbation 

Recombination probabilities: static, variable or dynamic 
Selection methods: roulette, tournament, rank, elitism 
Fitness scaling: linear with cut-off, quadratic, exponential 

TABLE I. Examples of various options in designing a ge- 
netic algorithm. 



The merit of simple exact lattice models |2§] is the 
ability to test ideas easily and suggest extrapolations to 
real systems. Simple exact models are 'simple' since only 
a few parameters are required, and 'exact' since physi- 
cal properties can be calculated exactly. Lattice models, 
although unrealistic in appearance, provide several ad- 
vantages over real-space models. From a folding dynam- 
ics point of view, they can explore long time behaviour; 
while from a structure prediction/optimization point of 
view, as in our case, they provide a valuable test-bed for 
protein structure optimizers. 

Runs with various initial conditions were carried out to 
ensure that the GA produced similar results each time. 



IV. THE SIMPLE GENETIC ALGORITHM FOR 
LATTICE CONFORMATIONAL SEARCH 

The Simple GA (SGA), as defined by Goldberg ||, 
is the simplest of all genetic algorithms. The original 
SGA manipulated binary strings which encode a trial so- 
lution of the problem at hand. However, a major mod- 
ification for searching protein conformations on a cubic 
lattice is to use a more natural representation for this 
problem. Since a simple cubic lattice is spatially restrict- 
ing, a string of bond directions represents a folded chain 
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of beads; each symbol corresponds to an increment or 
decrement in the appropriate Cartesian coordinate of the 
successive monomer beads (see table . A conformation 
of the polymer is then translated to a set of monomer 
positions r< (i — 1, • • • , N) (where N is the number of 
monomer beads (residues)). 



Direction 



Ar 



(U)P 

(L)EFT 

(F)RONT 

(B)ACK 

(R)IGHT 

(D)OWN 



r z + 1 
r x - 1 
r y + l 
r y -l 
r x + 1 
r z - 1 



TABLE II. Bond directions describing lattice conforma- 
tions. A bond direction corresponds to a change, Ar, in one of 
the Cartesian coordinates of the successive monomer, keeping 
all other coordinates the same as the previous monomer. 



CROSSOVER 



MUTATION 



FIG. 1. The top half represents the action of a one-point 
crossover operator. Below is an illustration of a single ge- 
newise mutation. In both cases the left side represents the 
'before' situation, and the right side, the 'after' state. 



This representation has access to all 6 N lattice confor- 
mations, including all the non-physical, non-self-avoiding 
conformations. Thus the search task is a formidable one. 
A random population of conformations is generated and 
manipulated according to the GA dynamics (selection 
and recombination). The population size, S, is kept 
fixed. All individuals are replaced at each iteration, ex- 
cept for two copies of the current best conformation - this 
is known as 'elitism'. 

Selection is linearly proportional to fitness so that the 
probability, Pi, of selecting the i th conformation, with a 
fitness value Fi, to propagate to the next time step is 
given by: 



Pi = 



(1) 



Probabilities must be positive so a linear mapping with a 
cut-off value is used to convert the energy (E) minimiza- 
tion problem to a fitness (F) maximization: 



Ft = 



-Ei 



if Ei 



<0; 
> 0. 



(2) 



Selected individuals, strings, are modified in a recombi- 
nation process, to generate new solutions. Figure [I] shows 
a schematic representation of the one-point crossover and 
gene-wise mutation operators used by the SGA. These 
operators act stochastically on the selected individuals 
with fixed probabilities. In one-point crossover, a random 
crossover point is chosen for a pair of selected individu- 
als, and the bond directions ('genes') are swapped up to 
the crossover point. Mutations act on a single individual 
and randomly change the value of bond directions along 
the string. The workings of the SGA are summarised in 
figure 0. 



INITIALISE 
POPULATION 
SIZE N 



► POPULATION <- 
EVALUATE 
FITNESS 
SELECT A PAIR 

CROSSOVER? _Y 
PROB Pc 



_^ CROSS AT A 
RANDOM POINT 



MUTATE GENE? 
PROB Pm 



CHANGE ALLELE 
RANDOMLY 



N SELECTED N/2 _» 

PAIRS? 

FIG. 2. The simple genetic algorithm. A population of N 
conformers is initialised. These trial conformations are eval- 
uated and selected, based on their fitness. Pairs of selected 
conformers are crossed over with a probability, P c . Each bond 
direction ('gene') is then randomly mutated with a probabil- 
ity, P m - A new population is selected in this manner and 
repeated for many generations (typically 2000). 
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V. TEST ENERGY FUNCTIONS 

A. Random Energy Model 

Originally used in spin glass theory, the Random En- 
ergy Model (REM) |3l| was applied to protein folding 
by Bryngelson and Wolynes |$2) and later formulated for 
lattice protein models by Shakhnovich and Gutin [p9| . 
In the REM, a protein is described by a fixed sequence 
of random interaction energies. Although not discussed 
here, the random interaction model lends itself for an- 
alytical studies of the coil-globule transition in proteins 
|P9| . Specifically, the conformational energy is calculated 
as: 



N 

E({r 1 ..r N })= B ij A(r i -r j ) + 

JV 
N 



(3) 



-{I 



for i & j nearest neighbours 
otherwise 



if monomers i,j occupy same site 
otherwise 



S(r l - tj) 
where, 

D2 = energetic penalty parameter for sites containing 

2 or more monomers 

D3 = energetic penalty parameter for sites containing 

3 or more monomers 

Bij = disorderly interaction energies 

(energies are in units of fcgT where ks is Boltzmann's 

constant and T is the temperature in Kelvin) 

The interaction matrix, B^, is symmetric but ran- 
domly generated with a Gaussian distribution: 

P(Bij) = (2nB 2 )-i cxp(-(By - B ) 2 /2B 2 ) (4) 

The compactization observed in globule proteins is 
modelled in Eq. (^) using Hy as a negative interaction po- 
tential with mean Bq. B defines the spread, i.e. the stan- 
dard deviation, of the compactization interactions. The 
greater the spread, the more heterogeneous the protein. 
A zero-spread corresponds to uniform interactions; this 
special case constitutes the Fixed Energy Model (FEM) 
which we have used for the larger polymers studied here 
(64mer, 125mer). 

The last two terms in Eq. (^) represent excluded volume 
effects, that is they are energetic penalties for conforma- 
tions with lattice sites occupied by more than two (D2 



term) or more than three (D3 term) monomers. The ob- 
jective of the genetic algorithm is to find conformations 
without multiple occupancies at a single site, by minimis- 
ing this energy function. 



B. HP-model 

It is well known that correlations in the sequence of 
amino acid residues lead to sub-structures (secondary 
structures) common to all protein structures. These cor- 
relations can reduce the size of the 'alphabet' (code) from 
20 symbols to a lesser number. The simplest and most 
interesting is the classification of residues into two types: 
H and P |33| . The energy function for this system favours 
interactions between HH monomer types and is indiffer- 
ent to all other (PP and HP) interactions. This is known 
as the HP-model. It aims to highlight the importance 
of the hydrophobic effect in protein folding. In an aque- 
ous medium, globular proteins tend to have a core of 
hydrophobic residues, surrounded by polar residues on 
the surface; in the HP-model, the H monomers corre- 
spond to hydrophobic residues that collapse to form a 
core surrounded by polar, P, monomers. Much work has 
been carried out by Dill and co-workers on this model 

The energy function for this model is: 



JV 



E({ ri ..T N }) = -\e\ J2 A( ri - Tj ) 

i,j=l;i<3 
JV 

e 2 ^2S(ri -Tj) + 



1,3 

JV 



(5) 



A(r l -r,) = {J 

-{I 



if i,j both H-type & nearest neigh, 
otherwise 

if monomers i,j occupy same site 
otherwise 



where, 

|e| = strength of HH attraction (usually taken as I) 
£2 = energetic penalty parameter for sites containing 
two or more monomers 

£3 = energetic penalty parameter for sites containing 
three or more monomers 

Strictly speaking, the final two terms (ea, £3) in Eq.(|^) 
are not included in the HP-model. We borrow these 
terms from the Random Energy Model to drive the con- 
formational search towards self-avoiding conformations. 
This is necessary since all possible conformations are al- 
lowed; an energetic penalty is required to penalise con- 
formations with multiple occupancies at a single site. For 
both models (REM and HP), the penalty terms reduce 
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to zero if the search is successful and a low energy self- 
avoiding conformation is found. 



VI. RESULTS 

The first test function used was that from the Random 
Energy Model, Eq.(|). Shakhnovich et al. @ used a 
Metropolis Monte Carlo algorithm with this energy func- 
tion to find the native states of 27mer heteropolymers. 
They interpreted the Monte Carlo 'folding' algorithm as 
modelling the folding dynamics of polymers. However, 
although Metropolis Monte Carlo methods asymptoti- 
cally guarantee finding the thermal equilibrium state, it 
is unclear whether an interpretation beyond this has any 
validity. We view their procedure as a protein structure 
optimization approach. These authors enumerated all 
compact (cubic) self-avoiding conformations, which al- 
low them to determine the global minimum of each ran- 
dom 27mer sequence generated. They reported [^7j that 
three out of thirty sequences found the known global 
minimum; energies varied from -83.7 to -74.6 (in units 
of fcsT). Using the SGA, we found that three out of 
four sequences obtained compact, 100% cubic conforma- 
tions (see fig. ||), with energies ranging from -78.1 to 
-69.9 (units in fcsT). We cannot determine whether they 
arc the global energy-minimum structures without carry- 
ing out a full enumeration; this is computationally time 
consuming due to the NP-complctc nature of the prob- 
lem, and, more importantly, unnecessary for our design 
purposes. Furthermore, we were unsuccessful in our cor- 
respondence with the authors and were unable to obtain 
the random interaction matrices specifically used in their 
work ( pj 




FIG. 3. Example of a 100% cubic 27mer conformation 
found by the SGA. 

Longer polymers (64mer, 125mer) are more challeng- 
ing since the conformational space grows exponentially 
with the polymer length. We continued to use the 



REM with B=0. This corresponds to uniform in- 
teractions and thus guarantees that cubic conforma- 
tions (4x4x4, 5x5x5 respectively) occupy the multiply- 
degenerate ground states. The 64mer reached 91% cu- 
bicity and the 125mer runs found a conformation with 
85% cubicity (fig. f|). Since we are restricting our dis- 
cussion to cubic lattices, it is more accurate to describe 
structures according to how 'cubic' they are, rather than 
using the more general term 'compactness'. 

It is unfortunate that 100% cubic conformations were 
not found; however, we are using the simplest genetic 
algorithm. Nevertheless, this initial exercise was use- 
ful to establish optimal values for the GA parameters. 
The optimal GA parameters for short polymers were: a 
minimum population size of 400; a 20% probability of 
crossover and a 4% probability of mutating a bond direc- 
tion. Longer polymers required a larger minimum popu- 
lation size of 1000; a 90% probability of crossover and a 
2% probability of mutation. Having established a good 
set of GA parameters, the SGA was analysed using the 
second test function, the HP-model. 

Unger and Moult studied random HP-sequences of 
length 27 and 64 monomers ^lj. They used two op- 
timization methods: a Metropolis Monte Carlo (MC) 
method and a variation of the Monte Carlo method 
that incorporates a genetic algorithm (GAMC). The 
GAMC method corresponds to a population of Metropo- 
lis Monte Carlo conformers which 'mix' between them- 
selves through a crossover operation. The comparisons 
between these methods and our SGA are shown in tables 



III and IV 



Our SGA beat Monte Carlo in 17 of the 20 test se- 
quences and equalled it in finding low energy conforma- 
tions for two sequences. On average, conformations gen- 
erated by our SGA were 1.1 energy units lower than those 
found by the Monte Carlo method for the 27mers and 8.3 
units lower for the 64mers. 

When compared to the GAMC method, the SGA per- 
formed on average as well as the GAMC for the 27mers; 
SGA conformational energies were on average 0.5 unit 
higher. In the 64mer case, SGA conformations were on 
average 3.4 energy units higher than the conformational 
energies found by the GAMC. There was one 64mer se- 
quence for which the SGA found a lower energy confor- 
mation than the GAMC. An important note is the speed 
at which the SGA found low energy conformations. Only 
a fraction of the total number of steps were required for 
the SGA when compared to the Monte Carlo and GA- 
Monte Carlo. For example, for the 27mer, to reach a solu- 
tion of comparable quality, the number of steps required 
by the SGA was 3% of the number of steps required by 
the MC method, and 4% for the GAMC. 

Further studies were carried out using HP-sequences 
taken from Yue et al. pq| . They designed ten 48mer se- 
quences and determined the native conformational states 
using a constraint-based hydrophobic core construction 
method. This method determines the global minima 
of HP-sequences by constructing conformations with a 
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FIG. 4. Left: 91% cubic conformation for a 64mer homopolymer. Right: 85% cubic conformation for a 125mer homopolymer. 



sequence 


EsGA 


Num Steps 


AEmc 


AEqamc 


273d.l 


-8 


2.9E+04 


-1 


1 


273d.2 


-8 


2.1E+04 


1 


1 


273d.3 


-8 


1.1E+05 


-2 





273d.4 


-15 


2.7E+05 


-4 





273d.5 


-7 


9.6E+03 





1 


273d.6 


-11 


9.6E+04 


-2 





273d.7 


-11 


2.2E+04 


-1 


1 


273d.8 


-4 


8.6E+04 








273d.9 


-7 


6.0E+04 


-1 





273d.l0 


-10 


5.6E+04 


-1 


1 


AVERAGE 


-8.9 


7.6E+04 


-1.1 


0.5 



TABLE III. SGA comparisons with Unger and Moult's 
27mer results. The sequence number corresponds to the num- 
ber used by Unger and Moult to label their HP sequences. 
Esga is the lowest energy found by the SGA. The 'num 
steps' column reports the number of energy evaluations car- 
ried out by the SGA to reach the lowest energy state. AEmc 
is the energy difference between the lowest energies found 
by the SGA and Unger & Moult's Monte Carlo procedure: 
AEmc = E S ga - E M c- Similarly, AE G amc is the energy 
difference between the SGA and Unger & Moult's GAMC 
method. 



sequence 


Esga 


Num Steps 


AEmc 


AEgamc 


643d.l 


-21 


6.4E+05 


-9 


6 


643d. 2 


-26 


1.5E+06 


-9 


3 


643d.3 


-36 


1.7E+06 


-12 


-1 


643d.4 


-30 


1.9E+06 


-12 


4 


643d.5 


-28 


6.3E+05 


-8 


4 


643d.6 


-22 


4.6E+05 


-6 


7 


643d. 7 


-17 


2.1E+05 


-2 


3 


643d.8 


-28 


1.4E+06 


-9 


1 


643d.9 


-29 


9.8E+05 


-10 


3 


643d. 10 


-20 


3.1E+05 


-6 


4 


AVERAGE 


-25.7 


9.8E+05 


-8.3 


3.4 



TABLE IV. SGA comparisons with Unger and Moult's 
64mer results. The sequence number corresponds to the num- 
ber used by Unger and Moult to label their HP sequences. 
Esga is the lowest energy found by the SGA. The 'num 
steps' column reports the number of energy evaluations car- 
ried out by the SGA to reach the lowest energy state. AEmc 
is the energy difference between the lowest energies found 
by the SGA and Unger & Moult's Monte Carlo procedure: 
AEmc = Esga - E M c- Similarly, AEqamc is the energy 
difference between the SGA and Unger & Moult's GAMC 
method. 
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core of H (hydrophobic) residues that also minimize the 
surface area of the conformation. The difference in the 
conformational energies found by the SGA and the na- 
tive state energy is labelled as AE^. Yue et al. also 
used a conformational search algorithm for HP-sequences 
called a 'hydrophobic zipper' (HZ). In this case, the H 
monomers attract nearby H monomers and bring them 
together in a process akin to nucleation. The energy dif- 
ference between conformations found by this method and 
the SGA is denoted as AEhz ■ The comparisons are sum- 
marised in table |v|. Conformations found by the SGA 
were on average 6.0 energy units higher. An example of 
a compact conformation by the SGA is shown in figure 
I). In no cases did the SGA equal or beat conformational 
energies obtained by the hydrophobic zipper method. 



number of nearest neighbours=39 




FIG. 5. Example of a compact HP conformation, with a 
hydrophobic core, found by the SGA. 

Runs took from several minutes for short polymers 
(27mers, 48mers) to hours for longer cases (64mers, 
I25mcrs) on a DEC 3000 Alpha workstation. 

VII. DISCUSSION 

The conformations found by the SGA are not promis- 
ing in themselves. However, the aim of the study is to 
analyse the simple genetic algorithm and determine what 
factors are important in designing PSP-GAs. What can 
we learn from these results about PSP-GAs? 

It is clear from the 27mer studies that the SGA per- 
forms well for short polymers and tends to find compact, 



sequence 


EsGA 


#Steps 


AEhz 


AEn 


483d.81 


-24 


1.6E+06 


7 


8 


483d.82 


-24 


4.7E+05 


8 


10 


483d.83 


-23 


1.9E+05 


7 


10 


483d.84 


-24 


8.6E+05 


7 


10 


483d.85 


-28 


2.4E+05 


2 


4 


483d.86 


-25 


5.5E+05 


4 


7 


483d.87 


-27 


3.8E+05 


2 


5 


483d.88 


-26 


4.2E+05 


3 


5 


483d.89 


-27 


1.3E+05 


4 


7 


483d.90 


-26 


7.0E+04 


3 


7 


AVERAGE 


-25.4 


4.9E+05 


9.4 


7.3 



TABLE V. SGA comparisons with Yue and Dill's 48mer 
studies. Sequence number corresponds to the number used 
by Yue et al.to label their HP sequences. Esga is the lowest 
energy found by the SGA. The 'num steps' column reports the 
number of energy evaluations carried out by the SGA to reach 
the lowest energy state. AEhz is the energy difference be- 
tween the lowest energies found by the SGA and Yue et al.'s 
hydrophobic zipper method: AEhz = Esga — Ehz- Sim- 
ilarly, AEn is the energy difference between the SGA and 
the native state energy found by Yue et al. using the con- 
straint-based hydrophobic core construction method. 

low energy structures. However, as was evident in our ini- 
tial investigation of 64mers and I25mers using the REM, 
the SGA performs below average. Although the SGA 
method was better than the Monte Carlo method, it did 
not perform as well as the GAMC and was even worse 
than the HZ algorithm. To some extent, it is not sur- 
prising that the hydrophobic zipper method performed 
better than the SGA. HZ is a specialised search algo- 
rithm for the HP-model and generates conformations by 
explicitly forming H-H contacts. It is not uncommon for 
specialised algorithms to solve certain cases of an NP- 
complcte problem - however, a general solution remains 
difficult g. 

In general the SGA finds it hard to generate compact 
conformations for the longer polymers - why? We in- 
vestigated this further using the HP-model. The best 
way to analyse the dynamics of the SGA is to observe its 
time evolution rather than merely the end result. Thus, 
the lowest energy conformations were observed at various 
time points. In several cases, the SGA produced two clus- 
ters of residues with hydrophobic (H) cores connected by 
a 'thread' - similar to loop regions in real proteins. It is 
promising that such conformations (see figure |^) are pos- 
sible since proteins with sub-structures connected by loop 
regions are common in Nature. However, this is not the 
lowest energy conformation for the HP-model, so we con- 
clude that the clustering is due to restrictions imposed by 
the GA dynamics, that is to selection and recombination. 
Selection is an important aspect of the GA dynamics but 
we do not think it is the reason for the occurrence of clus- 
tered solutions; if a lower energy globule existed in the 
population, the selection function would have assigned it 
a high probability and elitism would guarantee its selec- 
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tion into the next generation. 

The problem must therefore lie in the recombination 
stage: it appears that the one-point crossover and mu- 
tation operations cannot inter-digitate the clusters suc- 
cessfully. We believe that these recombination operators 
would fold the clusters into each other creating many 
sites with multiple occupancies; these are penalised by 
the excluded volume terms in Eq.^ heavily reducing the 
'survival probability' of such structures. The difficulty 
appears to be in the formation of only two large clus- 
ters, or 'building blocks'; we think that more clusters 
but smaller in size would be easier to manipulate. How 
do we promote this in the GA dynamics? 

Genetic algorithms manipulate partial solutions in 
their search for the overall optimal solution |]l^ , p4| . 
These partial solutions or 'building blocks' correspond 
to sub-strings of a trial solution - in our case local sub- 
structures within the overall conformation. Clearly, the 
level of crossover influences the size of the building blocks. 
A multi-point crossover would generate smaller building 
blocks (a higher 'resolution') and consequently smaller 
clusters that should facilitate their inter-digitation. In 
the case of the 27mer, the one-point crossover generates 
building blocks of 13.5 monomers, on average. Based 
on this argument, 48mers and 64mers require at least 
two to three point crossovers to reduce the size of the 
building blocks to ~ 10. Furthermore, the local pertur- 
bation dynamics of the Monte Carlo methods aids the 
local 'fine tuning' of conformations as manifested by the 
GAMC method. It is believed that the two recombina- 
tion operators of multi-point crossover and a local pertur- 
bation are required for any PSP-GA to be fully effective. 
Work is under way to design a PSP-GA to optimize real 
protein structures (rather than lattice models) which in- 
cludes these requirements. 

One further problem with genetic algorithms is 'know- 
ing when to stop'. Most optimization algorithms deal 
with a single solution at a time and decide to stop when 
there has been no change in the cost or energy function 
for a successive number of steps. Since GAs deal with an 
ensemble of solutions, a quantity analogous to the statis- 
tical mechanical free energy is used. The 'population free 
energy', F, is calculated from its 'partition function', Z: 



N 



z = /2< 



-Ei 



(6) 



where the sum is over the total number of conformers in 



a population and Ei is the energy of the 
Hence, 



;lh 



F 



HZ) 



conformer. 



(7) 



This approach is advantageous over using the mean 
energy of the population in two ways. Firstly, the mean 
energy fluctuates around the equilibrium making it diffi- 
cult to use as a stop criterion, and secondly, F contains 
more information on the shape of the energy distribution, 



including, in particular, its 'entropy'. F appears to play 
the role of a Lyapounov function in the GA dynamics - 
convergence is synonymous with its minimal value. Fig- 
ure |?] plots the energy distributions of the population at 
various time steps. The plot shows very clearly that the 
GA dynamics converge the population of conformations 
to an equilibrium distribution. This is characterised by 
F as shown in figure ||. 

Evolution of energy distributions 
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FIG. 7. Evolution of a population: plotting the energy dis- 
tributions at various time steps, t is the percentage of the 
total run-time elapsed. The population converges within 50% 
of the run time. 



Mean, minimum and free energies of an evolving population 
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FIG. 8. Evolution of the energies of a population. The 
free energy, which characterises the energy distribution of a 
population, reaches a convergence, or equilibrium point. 

In conclusion, the genetic algorithm approach to the 
protein structure prediction problem offers a promising 
potential method of solution. GAs are fast and effi- 
cient at searching the rugged conformational landscapes 
presented by protein molecules. We have established 
some guidelines for designing PSP-GAs in this paper and 
are currently implementing them in an improved GA to 
search for realistic protein structures. 
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1. nearest=7 energy=-7.000 



2. nearest=24 energy=-14.000 



3. nearest=27 energy=-16.000 



4. nearest=30 energy=-17.000 



< • 



FIG. 6. Evolution of the best conformation per generation: formation of clusters in the Simple Genetic Algorithm. 
Dark=H=hydrophobic, Light=P=polar. 
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